{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 建模前工作流"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "本章包括以下主题：\n",
    "\n",
    "1. [从外部源获取样本数据](getting-sample-data-from-external-sources.html)\n",
    "1. [创建试验样本数据](creating-sample-data-for-toy-analysis.html)\n",
    "1. [把数据调整为标准正态分布](scaling-data-to-the-standard-normal.html)\n",
    "1. [用阈值创建二元特征](creating-binary-features-through-thresholding.html)\n",
    "1. [分类变量处理](working-with-categorical-variables.html)\n",
    "1. [标签特征二元化](binarizing-label-features.html)\n",
    "1. [处理缺失值](imputing-missing-values-through-various-strategies.html)\n",
    "1. [用管线命令处理多个步骤](using-pipelines-for-multiple-preprocessing-steps.html)\n",
    "1. [用主成分分析降维](reducing-dimensionality-with-pca.html)\n",
    "1. [用因子分析降维](using-factor-analytics-for-decomposition.html)\n",
    "1. [用核PCA实现非线性降维](kernel-pca-for-nonlinear-dimensionality-reduction.html)\n",
    "1. [用截断奇异值分解降维](using-truncated-svd-to-reduce-dimensionality.html)\n",
    "1. [用字典学习分解法分类](decomposition-to-classify-with-dictionarylearning.html)\n",
    "1. [用管线命令连接多个转换方法](putting-it-all-together-with-pipelines.html)\n",
    "1. [用正态随机过程处理回归](using-gaussian-processes-for-regression.html)\n",
    "1. [直接定义一个正态随机过程对象](defining-the-gaussian-process-object-directly.html)\n",
    "1. [用随机梯度下降处理回归](using-stochastic-gradient-descent-for-regression.html)\n",
    "\n",
    "<!-- TEASER_END -->"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 简介"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "本章介绍数据获取（setting data），数据整理（preparing data）和建模前的降维（premodel dimensionality reduction）工作。这些内容并非机器学习（machine learning，ML）最核心的部分，但是它们往往决定模型的成败。\n",
    "\n",
    "本章主要分三部分。首先，我们介绍如何创建模拟数据（fake data），这看着微不足道，但是创建模拟数据并用模型进行拟合是模型测试的重要步骤。更重要的是，当我们从零开始一行一行代码实现一个算法时，我们想知道算法功能是否达到预期，这时手上可能没有数据，我们可以创建模拟数据来测试。之后，我们将介绍一些数据预处理变换的方法，包括缺失数据填补（data imputation），分类变量编码（categorical variable encoding）。最后，我们介绍一些降维方法，如主成分分析，因子分析，以及正态随机过程等。\n",
    "\n",
    "本章，尤其是前半部分与后面的章节衔接紧密。后面使用scikit-learn时，数据都源自本章内容。前两节介绍数据获取；紧接着介绍数据清洗。\n",
    "\n",
    ">本书使用scikit-learn 0.15，NumPy 1.9和pandas 0.13，兼容Python2.7和Python3.4。还会用到其他的Python库，建议参考对应的官方安装指令。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.4.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}